Introduction to Machine Learning

Introduction

In this chapter, we explore the concept of machine learning, a branch of artificial intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly or directly instructed (in the “if…then…” sense). While the term may sound technical, machine learning is already all around us: from the way streaming platforms recommend movies, to how spam filters detect unwanted emails, to the personalized ads we see online, and even to how financial institutions detect fraudulent transactions. These technologies help us solve problems, uncover patterns, and improve decision-making in a wide range of fields.

To illustrate, consider a city government trying to improve traffic flow and reduce congestion. Instead of relying solely on manual traffic counts or past assumptions about busy hours, the city can collect data from sensors, GPS devices, and traffic cameras and, by applying machine learning, officials can identify patterns in vehicle movement, predict traffic build-ups before they happen, and adjust traffic light timings in real time. They can even simulate the impact of road closures or new bike lanes before implementing changes. In short, machine learning transforms raw traffic data into actionable insights that support smarter, data-driven urban planning.

These examples highlight the power of machine learning to turn complex, messy data into useful insights. But how does this actually work? And how does machine learning differ from other data-driven approaches like traditional statistics or data mining? To understand this, we need to take a closer look at what it means for a machine to “learn” from data and how this learning process unfolds in practice.

Defining Machine Learning

At its core, machine learning is about developing computer algorithms that can transform raw data into insights or actions (Mitchell, 1997). Rather than being told exactly what to do in every given situation, these algorithms learn patterns from past examples and use them to make predictions on new, yet unseen data. In contrast to traditional programming, where every rule must be hard-coded, machine learning systems are designed to adapt and improve as they are exposed to more data.

An intuitive way to think about this is through analogy: teaching a child to recognize fruits. We don’t give the child a detailed list of rules for every possible fruit or shape. Instead, we show examples—this is an apple, this is a banana—and let the child observe the patterns. Over time, the child learns to identify new fruits correctly, even if they’ve never seen that exact variety before. In a similar way, we “train” a machine by feeding it examples from past data and letting it learn the relationships between inputs and outcomes. The final product of this process is a machine learning model, a system that can make informed predictions or decisions when presented with new, yet unfamiliar data.

This learning-based approach distinguishes machine learning from other data-driven disciplines such as traditional statistics and data mining. While there is overlap among them, their goals and methods differ:

Traditional statistics often focuses on explaining relationships between variables and testing hypotheses, typically under strict assumptions about the data. For example, a statistical analysis might ask, “Is there a significant relationship between income level and life expectancy?” and use well-defined models to test that relationship with precision and interpretability in mind. Machine learning, by contrast, prioritizes prediction and scalability over formal inference. It may not always provide a transparent explanation of how inputs relate to outputs, but it excels at finding complex patterns in large datasets and making accurate forecasts. For instance, a machine learning model might be used to predict which customers are likely to cancel a subscription, even if we don’t fully understand all the reasons why.

Data mining is a related concept and it lies somewhere between the two. It emphasizes the discovery of useful patterns or anomalies within large datasets, often for exploratory purposes. Data mining often employs machine learning techniques to screen through data and uncover hidden relationships or clusters. While data mining typically involves machine learning, the reverse is not always true; machine learning encompasses a broader set of methods that do not necessarily include data mining tasks.

Data Mining vs Machine Learning

There are ongoing debates and disagreements among experts about the precise boundaries and overlap between data mining and machine learning, as their definitions and applications can vary across fields and contexts (Lantz, 2023).

Despite these differences, all three—machine learning, statistics, and data mining—rely on a common foundation: data. They each process information to extract meaning, but they do so with different questions in mind and different tools at hand. In the context of machine learning, the focus is on building systems that can generalize from past data and apply what they’ve learned to new situations, continuously improving over time.

Benefits and Limits of Machine Learning

Machine learning has rapidly transformed how we interact with technology, make decisions, and solve complex problems. From self-driving cars to real-time translation, its influence is profound. But while machine learning opens new doors, it also introduces risks and responsibilities.

One of the major benefits of machine learning is its ability to bring efficiency and automation to tasks that would otherwise be time-consuming, repetitive, or difficult for humans to perform at scale. In everyday life, email filtering systems use machine learning to detect and block spam messages. In manufacturing, predictive maintenance systems analyze sensor data from machines to predict failures before they occur, reducing downtime and repair costs. In healthcare, algorithms can scan medical images to detect diseases such as cancer, sometimes with greater speed and accuracy than human specialists. These examples demonstrate that machine learning not only speeds up processes but also ensures consistent performance (and even lower cost), as machines do not suffer from fatigue or forgetfulness and can work continuously.

Another significant advantage of machine learning is personalization. By learning from user behavior, algorithms can tailor content, products, or services to individual preferences. Streaming platforms, such as Netflix or Spotify, analyze what users watch or listen to and use that data to recommend shows or songs that align with their tastes. E-commerce websites make use of similar data to suggest products a customer might be interested in, often increasing engagement and sales. Online education platforms can adapt the pace and difficulty of lessons based on each student’s progress, thereby enhancing the learning experience. These personalized systems rely on machine learning’s capacity to detect patterns in large datasets and apply them in real-time.

Machine learning also excels at making predictions. Organizations across many sectors rely on it to anticipate future outcomes based on historical data. For instance, retail companies use it to forecast product demand, helping them manage inventory more effectively. In finance, banks deploy machine learning models to estimate the likelihood that a customer will repay a loan. Meteorologists improve the accuracy of weather forecasts using machine learning systems trained on decades of data. These predictive capabilities help institutions make more informed, data-driven decisions that can lead to cost savings, improved planning, and reduced risk.

Beyond prediction, machine learning can uncover patterns in data that are too subtle or complex for humans to detect on their own. In biology, for example, clustering algorithms group genes with similar behavior, assisting researchers in understanding the roles of different genes in various cellular processes. In marketing, machine learning helps businesses identify distinct customer segments with similar behaviors or preferences, enabling more targeted and effective campaigns. These insights are particularly valuable when dealing with large, unstructured datasets where traditional methods fall short.

Once a model has been trained, machine learning systems are also highly scalable. That means they can apply what they’ve learned across millions of data points with little extra cost. For example, a fraud detection algorithm trained on one set of bank transaction data can be used to monitor thousands of transactions per second. A voice recognition system developed for one language can be extended or adapted to work with many others. This scalability makes machine learning especially powerful in digital environments where volume and speed are crucial.

Despite these impressive capabilities, machine learning also has important limitations and ethical considerations. A major concern is bias and discrimination. Machine learning systems learn from historical data, and if that data contains biases, the model will likely replicate or even amplify them. A well-known case involved a large tech company that trained a resume screening model on past hiring data. Because their historical workforce was overwhelmingly male, the model learned to favor resumes that mirrored that profile and penalized indicators associated with women, such as participation in women’s clubs or graduation from all-women colleges. In healthcare, some predictive models used to allocate resources have systematically underestimated the severity of illness in Black patients because they relied on healthcare spending as a proxy for need, overlooking the fact that marginalized groups often have less access to care. These examples highlight how biased inputs lead to biased outcomes, and worse, they can give the illusion of objectivity, making unfair systems appear legitimate.

Privacy is another critical issue. Many machine learning applications depend on vast amounts of personal data. While users may enjoy more relevant ads or services, they are often unaware of how much data is being collected or how it is used. Social media platforms track our likes, shares, and clicks to tailor content and advertisements. Smart home devices sometimes record conversations without users’ knowledge and store this audio data on cloud servers. This raises uncomfortable questions about surveillance, consent, and the line between personalization and intrusion. As machine learning becomes more embedded in our lives, maintaining privacy and ensuring users understand what they’re agreeing to becomes ever more important.

Transparency is also a challenge. Many advanced machine learning models operate as “black boxes”, meaning they make accurate predictions, but it’s hard to understand how or why. This becomes a problem in high-stakes decisions. For instance, if a person is denied a loan based on a model’s prediction, they may want to know the reason. But with complex models, that explanation may not be readily available. In criminal justice, some jurisdictions have used machine learning tools to predict whether someone is likely to re-offend. These tools can affect sentencing or parole decisions, yet the models are often proprietary and difficult to scrutinize, raising questions about due process and accountability.

Moreover, machine learning can be used to generate fake or misleading content. Deepfake technology, which uses AI to create realistic but fake videos of people, poses a serious threat to public trust. A deepfake of a political leader making false statements could sow confusion or panic. Similarly, AI-generated text can be used to flood the internet with fake news or misinformation. These tools can be used maliciously, eroding trust in information sources, influencing elections, or damaging reputations.

There are also economic implications. Machine learning-driven automation can displace human workers, particularly in roles that involve repetitive or routine tasks. In logistics, self-driving trucks threaten the jobs of professional drivers. In customer service, chat-bots increasingly handle queries that once required human agents. Even in journalism, algorithms can now write basic articles about sports scores or financial updates. Without coordinated efforts in retraining and education, this shift could exacerbate inequality, benefiting those with the skills to work alongside machines while leaving others behind.

In light of all this, it’s clear that machine learning is a powerful tool, which can be used to enhance our capabilities and improve decision-making, but also one that must be handled with care. Its value depends on how it’s developed and deployed. Machine learning systems should be built thoughtfully, with attention to fairness, transparency, and accountability. This includes careful selection and documentation of training data, regular monitoring for harmful effects, and clear communication about how decisions are made. It also means involving a diverse range of voices in discussions about how these technologies are designed and used, so that their benefits are widely shared and their risks are responsibly managed.

Data Structures for Machine Learning

In machine learning, data typically comes in two main forms: structured and unstructured. Understanding the structure of your data is important because it influences the kind of models you can use, the pre-processing steps you need, and even the performance of your final solution.

Structured data is organized into a clear and consistent format, most commonly a table or (in R) a data frame. In this format:

Rows represent individual observations, examples, or data points—each row corresponds to a single case or record.
Columns represent features, variables, or predictors—each column holds a specific type of information measured across all observations.

Most classical machine learning algorithms are designed to work with structured data. That’s why this format is the starting point for the majority of real-world machine learning workflows. As Hadley Wickham emphasizes in his concept of tidy data, “Each variable forms a column, each observation forms a row, and each type of observational unit forms a table.” This principle ensures that data is organized in a consistent and meaningful way, which simplifies the steps of cleaning, exploring, transforming, and modeling data. Many machine learning tools (especially in R and Python) are built around this tidy format, making it easier to move from raw data to a working model. Tidy data is not a new concept for us; we have worked with it throughout this book, which makes applying machine learning methods more straightforward.

On the other hand, unstructured data, such as text documents, images, audio, or video, does not fit naturally into rows and columns. While this type of data is often intuitive for humans to interpret, it poses great challenges for machine learning. Unstructured data must often be transformed into a structured format before it can be used effectively. For example, natural language processing might convert text into word counts, and image processing might represent pixels as arrays of numbers.

Because of its clarity, accessibility, and direct compatibility with most algorithms, structured data serves as the foundation for many machine learning applications. It allows practitioners to quickly build, evaluate, and interpret models, especially when starting out or working on standard business or scientific problems. Even in more advanced use cases involving unstructured data, part of the workflow often includes converting that data into a structured format that algorithms can handle.

Types of Machine Learning

Machine learning can be broadly divided into Supervised Learning and Unsupervised Learning, with several other important variants that build upon or combine these foundational types.

Supervised vs Unsupervised Learning

Supervised Learning involves training models with data which includes a specific target or dependent variable. The goal of the training is for the “machine” to learn a function that maps input features to this known outcome. For example, consider an online store interested in predicting online purchases: the target (dependent) variable could be the number of purchases made within a month. The model uses historical data where this target is known, learns the relationship, and then predicts future (i.e., yet unknown) target values.

Terminology

The variable we want to predict is called the target or dependent variable. The variables used to make predictions are called predictors, features, or independent variables.

This type of machine learning may look very familiar to us: linear regression can actually be used for this type of machine learning. In Chapter Linear Regression - Statistical Inference, we explained how we use independent variables to make statistical inference regarding the dependent variable. We discuss more details on this in the next chapter.

Unsupervised Learning, by contrast, deals with data without a predefined target variable. The objective here is to uncover patterns or structure directly from the features. Using the same digital company example, unsupervised learning might group customers into segments like “Loyal customers” (many visits and many purchases), “Just visitors” (many visits but few purchases), and “Transactional shoppers” (few visits and few purchases) based solely on their behavior. Because there is no explicit target, evaluating the quality of results is often more subjective and requires domain knowledge.

Supervised Learning

In supervised learning, the problems generally fall into two main categories based on the nature of the target variable:

Regression Problems: The target variable is continuous. For instance, predicting the net sales of an automobile manufacturing company for the next calendar year is a regression task. The goal is to produce predictions as close as possible to the actual values. Predicting 349.5 million when the true sales are 350 million is considered good, even if not exact.
Classification Problems: The target variable is categorical, meaning it can take one of a limited set of distinct values, typically called classes. A machine learning algorithm designed for this type of task is called a classifier, because its goal is to assign inputs to the correct class. Though in software such as R, these categories might be stored as character strings, factors, or even numbers, each value represents a different group or class. Classification problems can be:
- Binary Classification: When there are only two classes, such as predicting whether a customer renews a contract or not.
- Multi-class Classification: When there are more than two classes. For example, predicting whether a customer will continue with the same contract, switch contracts, or cancel altogether.

In classification, models can output either a discrete class label (e.g., “renew” vs. “cancel”) or a probability representing the likelihood of belonging to a particular class. When probabilities are predicted, the problem resembles regression but constrained within 0 and 1.

Unsupervised Learning

Unsupervised learning primarily involves two important tasks:

Clustering: Grouping observations into categories based on similarity. Using the digital company example, customers can be clustered into groups like “Loyal customers”, “Just visitors”, or “Transactional shoppers” based on patterns in their data, without predefined labels.
Dimensionality Reduction: Simplifying datasets by reducing the number of variables while preserving meaningful information. For example, if a dataset includes both weight and Body Mass Index (BMI) for patients, these two related features can be combined or reduced to fewer features using unsupervised methods. Since BMI is derived from weight and height, keeping both weight and BMI separately might be redundant. Dimensionality reduction techniques help combine or summarize such correlated variables to improve model efficiency and interpretability.
Pattern Recognition: Identifying recurring patterns or structures within data, often used in image processing, speech recognition, and bioinformatics. It overlaps with clustering but emphasizes recognizing meaningful regularities or features within the data without labeled targets.

Other Types of Machine Learning

Beyond the basic types of machine learning, several specialized approaches exist to tackle different challenges or make better use of available data:

Semi-Supervised Learning: This approach uses a small amount of labeled data together with a large amount of unlabeled data. The model learns from the labeled examples but also tries to find patterns in the unlabeled data to improve its understanding (Zhu & Goldberg, 2009). For example, imagine we have thousands of photos, but only a few are labeled as “cat” or “dog.” Labeling every photo is expensive and time-consuming. Semi-supervised learning helps by using those few labeled images to guide learning and then “guess” labels for the many unlabeled images. This can significantly boost accuracy compared to using only the small labeled set.

Self-Supervised Learning: Self-supervised learning lets the model generate its own training signals from the data without needing manual labels (LeCun & He, 2022). For instance, in natural language processing, a model might be trained to predict a missing word in a sentence. By doing this repeatedly across millions of sentences, the model learns the structure and meaning of language. In computer vision, the model could learn to predict parts of an image from other parts. This approach allows training on vast amounts of raw data, making it very powerful for tasks where labeled data is scarce or expensive.

Meta-Learning (Learning to Learn): Meta-learning focuses on training models that can quickly adapt to new tasks using very little data (Finn, Abbeel, & Levine, 2017). Think of it as teaching a model how to learn efficiently. For example, suppose you want a model to recognize different species of birds. Instead of training a new model from scratch every time you want it to identify a new bird species, a meta-learning model has already learned general patterns of birds. With just a few new images, it can quickly adapt and correctly identify the new species. This is especially useful in situations where data for new tasks is limited.

Reinforcement Learning: In reinforcement learning, an agent learns by interacting with its environment and receiving feedback in the form of rewards or penalties (Sutton & Barto, 2018). Unlike supervised learning, the agent doesn’t get explicit answers but must figure out the best actions through trial and error. For example, a robot learning to walk might receive a positive reward every time it takes a successful step and a penalty when it falls. Over time, by trying different movements, it discovers the best walking strategy. This approach is widely used in robotics, autonomous vehicles, and game-playing AI like AlphaGo.

Adversarial Learning: Adversarial learning involves training two models that compete against each other, pushing both to improve. A famous example is Generative Adversarial Networks (Goodfellow, Bengio, & Courville, 2016). One model, the generator, tries to create fake data (like images of people who don’t exist), while the other, the discriminator, tries to tell real data apart from fake. The generator improves by learning to fool the discriminator, and the discriminator improves by getting better at spotting fakes. This back-and-forth leads to very realistic data generation and has applications in art, image enhancement, and data augmentation.

Model Types

In machine learning, models are the core tools we use to extract patterns, relationships, and signals from data. Understanding different model types helps clarify what a model is trying to achieve and guides both the choice of algorithm and the evaluation criteria. Broadly speaking, models can be categorized into three types: descriptive, inferential, and predictive (Kuhn & Silge, 2022). These categories are not exclusive to machine learning, as they stem from broader data analysis traditions, but they remain deeply relevant, especially in supervised learning contexts where we have a clearly defined target variable. While supervised learning typically involves predictive or inferential goals (e.g., predicting an outcome or estimating effects), descriptive models are often used in unsupervised learning, where no outcome variable is available and the aim is to explore the structure of the data.

Descriptive models aim to summarize or describe the main characteristics of a dataset. Their primary purpose is to highlight trends, patterns, or artifacts within the data without necessarily making predictions about future observations. For example, a clustering algorithm that groups customers based on purchasing behavior provides descriptive insights that help understand the underlying structure of the data. Techniques like LOESS (see Chapter LOESS Regression) are often used descriptively to explore relationships in data because they reveal local trends without assuming a global functional form.

Inferential models focus on hypothesis testing and drawing conclusions about a population based on sample data. These models often involve assumptions about the data generation process, and the validity of their results heavily depends on how well these assumptions hold. Traditional statistical models such as linear regression used to test whether a certain factor significantly affects an outcome are inferential. Here, interpretability and the ability to understand the influence of specific variables are critical, and inferential statistics provide confidence intervals and significance tests to assess these effects.

Predictive models are designed to generate the most accurate possible predictions on new, yet unseen data (Kuhn et al., 2022; Hastie et al., 2009). Their success is measured by how well they generalize beyond the training examples (training sets). Predictive modeling can use either mechanistic models, which are based on known theoretical relationships (like physics-based models), or empirically driven models, which rely on observed data to uncover patterns without necessarily understanding the underlying mechanism. Machine learning algorithms such as random forests (discussed in later chapters) are examples of empirically driven predictive models.

At this point, it’s important to emphasize the difference between a model and a machine learning method. A model is the conceptual or mathematical structure that represents data patterns, while a machine learning method is the procedure or algorithm used to estimate or fit the model to data. For instance, a linear regression model can be fit using a number of methods: ordinary least squares, gradient descent, or robust regression techniques—however, the mathematical equation (the mathematical structure) would stay the same.

Interestingly, the same model type can serve multiple purposes, depending on the analyst’s intention. For example, a linear regression model can be descriptive when used to summarize data trends, inferential when testing hypotheses about coefficients, or predictive when the focus is on forecasting outcomes.

Model in Machine Learning Terminology

A model used in machine learning is often called a machine learning model, because it is created by a learning algorithm to represent patterns in the data.

Machine Learning Fundamental Concepts

Now that we discussed the machine learning process and types, it is important to keep in mind a number of fundamental concepts. These concepts are highly valuable as they form the cornerstone for understanding and implementing effective algorithms in various domains.

Bias-Variance Trade-Off

The Bias-Variance Trade-Off is perhaps the most popular and important concept in machine learning. Suppose we want to create a model to predict the box office revenues of a new movie. To achieve this, we have the option to use different predictors such as the movie’s budget, genre, or even the critics’ review of the movie trailers. As we train a model though, we want to capture the patterns in the data and not the noise. We can say that noise is something that is unpredictable, while pattern is something predictable, and we hope that our model will capture it in order to use it to generate accurate output. So, every data set consists of patterns, to which we want our model to pay attention, and noise, which we want our model to avoid.

On the one hand, we could simply ignore every predictor and say that we will just look at the box office revenues of a random sample of movies and calculate the mean, and we will use that mean as prediction for the box office revenues of every new movie. As a result, we would have a very simple model (just the average), a model that we would expect to not perform well, since its prediction would probably deviate a lot from the actual box office revenues for any given movie. In this case, we would say that we have high bias because the true value would differ a lot from our prediction:

\[\text{Bias} = \text{Revenues}_{\text{true}} - \text{Revenues}_{\text{mean}}\]

This concept of bias is the same as the one we discussed in Chapter Simple Linear Regression: it represents the deviation between the true value and the estimated value.

The good thing about that model though is that if we take a different random sample of movies and take again the average, we would expect that our prediction would be more or less the same, which suggests that—at least—there would be some consistency in our predictions. As a result, we could argue that our predictions would have low variance, meaning that the difference between the true value and our prediction is stable.

On the other hand, suppose that we consider a very large data set, use a lot of different predictors and use a very advanced machine learning methodology to create a truly complex model. In this case, we would expect that our prediction would be very close to the true value, meaning that bias would be very low. However, a very complex model means that most of the patterns and noise have been captured by the model. In other words, our model may see very rare cases (e.g., movies that due to very special circumstances had an extremely high or low box office revenue) as something that is commonplace. This would mean that a slight change in the data set (e.g., removing such movies) would lead to a different model (different estimated parameters), meaning that our predictions will not be consistent. In other words, our model would have high variance.

So the first simple model would underfit the data, meaning that it has not captured many of the patterns to be found in them, while the second would overfit the data, meaning that the model has captured the patterns too well, i.e., that the model sees almost everything as a structure to be repeated in future cases, which is not necessarily the case. As a result, the predictions of a simple model are of high bias (deviate a lot from the true values) but at least they are consistent (low variance), while the predictions of a complex model are of low bias (deviate less or even slightly from true values) but they are not consistent (high variance).

The bias-variance trade-off is something we need to keep in mind: increasing model complexity can reduce bias but often increases variance, so there is a trade-off between accuracy and stability that we must consider when building machine learning models (Hastie et al., 2009). Additionally, although complex models usually have lower bias than simple ones, there can be situations in which simple models outperform complex models while having lower variance.

No Free Lunch Theorem

A similar, yet very different concept, is that of the No Free Lunch Theorem. This theorem states that there is no algorithm that works best for every problem. Put differently, there is no universal solution or “free lunch” in the sense that one algorithm will always outperform others across all possible data sets; almost all models will capture some patterns in the data, regardless of the technique.

This idea implies that the effectiveness of different machine learning algorithms depends on the specific characteristics of the problem at hand (Wolpert, 1996). Respectively, when we work on a machine learning problem, it is important to consider factors like the nature of the data, the complexity of the problem, and the specific goals that we have in order to choose the most appropriate algorithm or combination of algorithms. This is also one of the reasons why exploratory data analysis is so important before we start diving into the predictive modeling process.

The Curse of Dimensionality

The Curse of Dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data in data analysis and machine learning (Bellman, 1961). Having a lot of variables (columns in a data frame) is not always the best approach, despite the conventional intuition that more data would lead to better outcomes; more data does not necessarily imply better data.

As the number of variables (or dimensions) in the data set increases, a number of problems can emerge such as:

Increased Sparsity: In high-dimensional spaces, data points tend to become more sparse, meaning there’s less data relative to the overall space. This can make it harder to find meaningful patterns or relationships in the data.
Increased Computational Complexity: Many algorithms become computationally exponentially intensive as the dimensionality of the data increases. For example, distance-based algorithms like K-nearest neighbours (discuss in later chapter) can become less efficient because computing distances in high-dimensional spaces becomes exponentially more costly (we discuss this method in a later chapter).
Overfitting: With high-dimensional data, there’s a risk of overfitting, where a model learns to capture noise or random fluctuations in the data rather than true underlying patterns (Hastie et al., 2009). It is important to remember here that noise refers to the unpredictable part of data that can mislead a model. Overfitting—as discussed above—means the model pays too much attention to noise, harming its ability to draw generalizable conclusions.

Overall, the curse of dimensionality highlights the importance of carefully selecting and transforming variables, as well as using dimensionality reduction techniques to mitigate the challenges associated with high-dimensional data in machine learning tasks.

Parsimony Principle

Generally, more complex models tend to offer better performance. However, due to their complexity, these models are often considered ‘black-box’ models, meaning the patterns they capture are difficult to grasp and interpret. This creates a trade-off between model complexity and interpretability. That said, there are cases where simpler models can perform just as well, or even outperform more complex ones, with only a slight difference in accuracy.

For example, a very complex linear regression model, with many polynomials and interactions, might achieve 86% accuracy on a test set, while a simple linear regression might reach 84%. Even though the complex model shows a slightly higher accuracy, the difference may not be practically significant (whether these 2 percentage points are considered important or not depends on the case). The key takeaway here is the principle of parsimony or “Ockham’s Razor”: when two models have similar performance, it is often best to choose the simpler one.

Machine Learning in Practice

While the theoretical foundations of machine learning are essential, it’s in practical applications that the process truly comes to life. At a high level, most machine learning projects move through a set of stages, from understanding the data to refining the model, with the ultimate goal of building systems that perform reliably on new, unseen data. Although these stages are presented below in order for clarity, in practice, the process rarely follows a straight, linear path. Instead, machine learning usually involves a series of iterative and interdependent steps, where insights from later stages often lead us to revisit earlier ones (Lantz, 2022).

1. Data Collection: Every machine learning project begins with data. The relevance, quality, and quantity of the data you collect set the stage for everything that follows. Depending on the problem, data may come from sensors, surveys, transaction logs, APIs, web scraping, or other digital sources. Without meaningful data, there is no foundation for learning.

2. Data Importation and Exploration: Once the data is collected, it must be imported and explored. This phase involves exploratory data analysis to gain an intuitive as well as a statistical understanding of the structure of the data. We look at the distribution of the variables, check for missing or inconsistent values, and explore relationships between those variables. At this stage, we also begin to define the modelling goal more concretely. For example, are we trying to predict the total revenues, or whether the revenues exceed a particular threshold? This clarity helps guide future modelling choices.

3. Data Cleaning and Pre-processing: Raw data is rarely ready for modelling. Data cleaning addresses problems such as missing values, incorrect data types, or outliers. Pre-processing may include normalizing numerical variables or removing redundant features. The goal is to make the data suitable for use.

4. Feature Engineering: Feature engineering is the process of creating or transforming variables to better expose the underlying structure in the data. Sometimes the original features are insufficient to capture important relationships, so we might calculate ratios, combine variables, or use domain knowledge to create more informative inputs. Good feature engineering can dramatically improve a model’s performance (Kuhn et al., 2013).

Pre-processing vs Feature Engineering

The boundary between data pre-processing and feature engineering is not always clear. In both steps, we manipulate the data for a specific purpose—cleaning, transforming, or creating variables—to make it more useful for the model. Sometimes the same action could be considered either pre-processing or feature engineering depending on context.

5. Model Training: This is the stage where the machine learning algorithm learns from the data. Based on the type of model, the training algorithm uses the input features and outcome variable to adjust internal parameters so the model can represent useful patterns. These internal parameters can be seen as settings, and different settings can lead to very different model behavior, even if the model structure stays the same. The dataset that is used exclusively to “teach” the model is called the training set.

Parameters in Machine Learning vs Statistics

The term parameter is similar to that in traditional statistics. In statistics, a parameter typically refers to a quantity derived from a population (see Chapter Introduction to Statistics), whereas in machine learning, a parameter denotes an internal setting of the algorithm that is adjusted during training to capture patterns in the data. These internal parameters define the behavior of the model and determine how it maps inputs to outputs.

The slope of a simple regression model is such a parameter. From a statistical perspective, we think of the slope as a fixed but unknown “true” value that we try to estimate from sample data. In machine learning, however, we do not focus on a single true parameter; instead, we view the parameter as something the algorithm learns from the data to minimize error or maximize predictive performance, without assuming it represents an underlying fixed population value.

6. Model Evaluation: After training, the model is tested on a separate subset of the data, called test set, to evaluate how well it performs on new, unseen cases. Performance can be assessed using different metrics depending on the problem (we discuss these metrics in later chapters). This evaluation helps determine if the model is overfitting (also known as “memorizing”) the training data or truly generalizing.

7. Model Tuning and Selection: Most machine learning models come with hyperparameters—settings that govern the learning process itself. These are different from the parameters the model learns during training. Parameters (such as the slope and intercept in linear regression) are learned from the training data and define the model’s behavior. Hyperparameters, on the other hand, are chosen before training and influence how the model learns (such as how fast it learns, or how complex it can become).

Tuning hyperparameters involves systematically exploring different hyperparameter combinations to find the configuration that yields the best performance. The goal is to balance the model’s ability to capture patterns (reduce bias) while avoiding overfitting (keep variance in check). Proper tuning can make a big difference: the same model trained with poor hyperparameter choices might perform worse than a simpler model, while the right hyperparameters can unlock the full potential of a complex algorithm.

8. Model Improvement and Deployment Even after a model performs well, there’s often room for refinement. This might include gathering more data, engineering better features, or adjusting preprocessing steps. Eventually, when a model is considered stable, it can be deployed to make real-world decisions—whether it’s recommending a product, flagging fraudulent transactions, or forecasting demand. Once deployed, the model’s performance should continue to be monitored and updated, as needed.

As mentioned above, the practice of machine learning is not fixed or strictly sequential (linear). In practice, the steps above are often overlapping, and the process is highly iterative. For example, during data exploration, we might realize that we lack critical variables, prompting a return to the data collection phase. Likewise, errors during model training can reveal issues in pre-processing that were previously overlooked. This back-and-forth movement between stages is normal and reflects the reality of working with complex, imperfect data. Maintaining flexibility is essential, allowing the workflow to evolve in response to insights uncovered during data exploration and model development.

Recap

In this chapter, we introduced the key concepts and foundations of machine learning. We discussed what machine learning is, how it differs from statistics and data mining, the types of learning (supervised, unsupervised, and others), the kinds of models we use, and essential ideas like the bias-variance trade-off, the No Free Lunch theorem, the curse of dimensionality, and the parsimony principle. We also walked through the practical workflow of machine learning projects, from data collection and preprocessing to model training, evaluation, tuning, and deployment.

It’s normal if some of this material feels abstract or unclear the first time. These ideas will become much more concrete in later chapters, where we will explore practical applications. Machine learning is both conceptual and practical, so understanding grows through experience, experimentation, and reflection.